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Introduction 

This  project  is  designed  to  address  the  subject  of  mammary  cancer  development.  The  purpose  of  the  project  is 
to  investigate  molecular  events  occurring  in  the  preclinical  stages  of  mammary  cancer;  the  results  may  lead  to 
insights  into  cancer  prevention  in  the  future.  Specifically,  the  project  investigates  the  intersection  between 
genome  demethylation,  retrotransposon  transcriptional  activity,  and  retrotransposon-driven  transcription  of 
cellular  genes.  Retrotransposon  promoters  are  well  recognized  to  function  as  alternative  promoters  for  different 
cellular  genes,  generating  chimeric  transcripts  that  may  or  may  not  function  in  the  same  way  as  transcripts  from 
the  regular  gene  promoter.  Transcriptional  activation  of  retrotransposons  is  strongly  linked  with  their  CpG  DNA 
methylation,  and  global  genomic  demethylation  is  one  of  the  commonest  molecular  changes  in  malignancies. 
The  project  tests  the  hypothesis  that,  in  preclinical  stages  of  tumor  development,  progressive  genomic 
demethylation  leads  to  increased  transcriptional  activity  of  retrotransposons  and  this,  in  turn,  leads  to 
transcription  of  otherwise  silent  genes,  potentially  setting  up  molecular  conditions  that  favor  cancer 
development.  Our  collaborator,  Dr.  Anne  Peaston,  developed  a  genetically  engineered  mouse  model  in  which  a 
specific  mammary  cell  population  is  fluorescently  marked  upon  initial  transcriptional  activation  of  the  SV40 
large  T  antigen  (SV40Tag)  oncogene.  SV40Tag  is  transcriptionally  activated  during  pregnancy  and  lactation, 
and  the  mice  are  predisposed  to  develop  mammary  cancer  after  three  pregnancy  and  lactation  cycles  (triparous), 
but  not  after  one  cycle  (uniparous).  Using  this  model,  populations  of  marked  cells  can  be  collected  for 
integrated  analysis  of  gene  expression,  promoter  usage,  and  DNA  methylation  after  defined  amounts  of 
exposure  to  SV40Tag  during  different  stages  of  preclinical  cancer  development. 

Body 

Development  of  New  Computational  Tools 

We  have  developed  the  WIMSi  (Washington  University  Methylation  Signatures)  pipeline  to  enable  us  to  use 
genome-wide  methylation  and  expression  data  to  examine  the  relationship  between  DNA  methylation  and 
expression.  An  initial  description  of  the  method  and  results  can  be  found  in  VanderKraats  et  al,  2013  (a  copy  of 
the  full  manuscript  is  in  the  appendix). 

In  brief,  current  computational  tools  focus  on  methods  to  compute  accurate  methylation  levels  from  the  data, 
methods  to  determine  differentially  methylated  regions,  and  visualization  tools,  such  as  genome  browsers. 
These  approaches  have  elucidated  the  genomic  organization  of  these  marks,  but  they  do  not  sufficiently  address 
how  changes  at  individual  loci  potentially  affect  function.  Correspondingly,  such  methods  find  only  a  modest 
negative  correlation  between  differential  DNA  methylation  at  promoters  and  expression.  While  it  is  possible 
that  DNA  methylation  is  not  a  strong  modifier  of  expression,  it  is  more  likely  that  our  computational  tools  are 
insufficient.  We  hypothesize  that  stronger  associations  are  not  observed  because  existing  analysis  methods 
oversimplify  their  representation  of  the  data  such  that  they  do  not  capture  the  diversity  of  existing  methylation 
patterns.  This  includes  changes  at  the  CpG  island  (region  over-represented  for  CG  dinucleotides)  at  a  gene’s 
TSS,  changes  at  CpG  island  shores  and  the  formation  of  long  partially-hypomethylated  domains.  In  addition, 
there  are  not  methods  to  systematically  search  for  new  patterns. 

We  have  developed  a  new  set  of  tools  for  discovering  differential  methylation  patterns  associated  with 
expression  change  using  genome -wide  high-resolution  methylation  data:  we  represent  differential  methylation 
as  an  interpolated  curve,  or  signature,  and  then  identify  groups  of  genes  with  similarly-shaped  signatures  and 
corresponding  expression  changes.  Our  methods  uncover  a  diverse  set  of  DNA  methylation  patterns  that  are 
conserved  across  a  variety  of  normal  and  cancerous  tissues  and  cell  lines.  Our  initial  implementation  and  results 
are  described  in  VanderKraats  et  al.  2013.  Applying  WIMSi  to  a  variety  of  datasets,  we  have  obtained  the  same 
basic  set  of  methylation  patterns  when  one  compares  stem  cells  to  differentiated  cells  (or  derived  precursors), 
non-tumorigenic  to  breast  tumor  cells,  cancer  cells  of  different  types,  and  different  tissues.  This  supports  the 
notion  that  these  patterns  are  generalizable  and  thus  that  the  mechanisms  of  how  methylation  and  expression 
interact  are  the  same  across  different  biological  contexts. 


We  have  also  applied  WIMSi  in  two  collaborative  projects  to  demonstrate  its  utility.  Using  WIMSi,  we 

4 


discovered  methylation  signatures  associated  with  cellular  senescence  that  are  similar  to  those  found  in  cancer 
(Cruickshanks  et  al  2013).  We  further  found  a  methylation  signature  associated  with  acute  myeloid  leukemia 
that  could  be  reversed  upon  treatment  with  DNA  methyltransferase  inhibitors  (Lund  et  al  2014).  The  WIMSi 
pipeline  is  publically  available  on  SourceForge  at  http://sourceforge.net/projects/wimsi/. 


Gene  List  Generation 

Enumerating  a  list  of  genes  for  which  expression  and 
methylation  changes  are  potentially  linked  is  a  primary 
interest  of  any  genome-wide  methylation  profding 
experiment,  and  is  one  of  the  direct  goals  of  this  project.  The 
WIMSi  method  outlined  above  is  tuned  to  discover  patterns, 
and  thus  produces  a  conservative  gene  list  potentially  prone 
to  false  positives.  Our  initial  publication  describes  one  way  to 
use  this  method  to  generate  a  gene  list,  however,  we  have 
further  refined  our  approach  using  a  set  of  supervised 
machine  learning  methods. 

Our  tool  (DTW-kNN,  Dynamic  Time  Warping  based  k- 
Nearest  Neighbors)  identifies  which  genes  have  associated 
methylation  and  expression  changes  in  a  pair-wise 
experiment  (i.e.  a  single  tumor  sample  compared  to  a  normal 
sample).  We  have  evaluated  our  tool  against  other  methods 
used  in  the  literature  using  whole-genome  bisulfite 
sequencing  data  recently  made  available  from  the  Cancer 
Genome  Atlas  (TCGA)  project  (Fig.  1  and  Table  1).  The 
single  window  classifier  is  a  decision-tree  based  classifier 
built  on  the  average  methylation  level  of  a  single  window 
around  the  gene’s  TSS.  During  training  a  number  of 
parameters  such  a  window  size  and  number  of  CpGs  are  used. 
The  DMR  (Differentially  Methylated  Region)  classifier  uses 
classic  published  definitions  for  DMRs  to  compare  against 
(Hansen  et  al  2012,  VanderKraats  et  al.  2013),  but  optimizes 
all  DMR  parameters  during  training.  The  ROI,  Region  of 
Interest,  classifier,  is  a  random  forest  model  that  uses  the 
average  methylation  of  multiple  gene  regions  (CpG  islands, 
transcription  start  site,  first  exons,  gene  body,  and  last  exon) 
as  inputs  and  is  an  adaptation  of  a  recent  method  developed 
to  predict  absolute  expression  levels  (Lou  et  al.  2014). 


%  Genes  Returned  (1-Rejection  Rate) 

Figure  1.  The  average  accuracy  of  each 
classifier  is  shown  versus  the  average 
percentage  of  genes  returned.  The  percentage 
of  genes  returned  is  equivalent  to  the 
percentage  of  1 -rejection  rate.  Addition  of  a 
rejection  rate  allows  for  the  fact  that  for  some 
genes  methylation  will  not  be  predictive  of 
transcriptional  activity.  Up  and  to  the  right 
defines  better  performance.  The  DTW-kNN 
and  ROI  classifiers  are  evaluated  using  leave- 
one-sample-out  evaluation.  The  Single 
Window  and  DMR  classifiers  were 
purposefully  trained  on  the  testing  data  to 
demonstrate  that  these  models  cannot  capture 
the  underlying  relationships  between  DNA 
methylation  and  expression.  50%  accuracy  is 
equivalent  to  a  random  guess. 


Table  1:  Range  of  genes  returned  per  tumor 
at  95%  accuracy  for  each  classifier.  No  genes 
were  identified  in  any  sample  for  the  ROI 
classifier  at  this  accuracy  level. 


Classifier 


Number  of  Genes 


Our  method  outperforms  commonly  used  methods  in  the 
literature  in  generating  both  more  accurate  and  longer  gene 
lists.  Further,  we  conclude  that  the  single  window  and  DMR 
models  commonly  used  in  the  literature  are  insufficient  to 
capture  the  complexity  of  the  relationships  between  DNA 
methylation  and  expression.  From  Table  1,  at  95%  accuracy, 
we  observe  that  DTW-kNN  returns  55-606  genes  depending 
on  the  tumor  sample,  which  is  much  more  than  the  next  best 
classifier  (Single  Window).  We  also  observe  the 
ineffectiveness  of  the  Region  of  Interest  (ROI)  classifier  in 

the  leave-one-sample-out  paradigm,  underscoring  the  importance  of  regional  methylation  shape  in 
understanding  differential  expression.  These  results  suggest  that  the  number  of  loci  potentially  affected  by  DNA 


DTW-kNN 
Single  Window 
DMR 
ROI 


55-606 

19-46 

4-18 
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methylation  in  primary  breast  tumors  has  been  vastly  understated.  Preliminary  results  from  this  method  were 
presented  at  the  2014  International  Conference  on  Intelligent  Systems  for  Molecular  Biology. 

Our  findings  suggest  that  the  role  of  DNA  methylation  cannot  be  fully  described  by  simply  characterizing  every 
gene  as  “methylated”  or  “unmethylated”.  Using  our  new  method,  we  have  found  and  described  a  variety  of 
methylation  patterns  that  correlate  with  expression  change.  The  true  power  of  this  method  is  in  its  ability  to 
discover  and  separate  distinct  patterns  without  a  priori  knowledge  about  existing  correlations,  which  cannot  be 
accomplished  with  contemporary  approaches.  This  allows  us  to  realize  the  full  potential  of  unbiased  genome¬ 
wide  profiling  of  DNA  methylation  to  reveal  previously  unknown  information  about  methylation’ s  functional 
role.  This  ability  will  be  especially  important  when  examining  regulatory  elements  within  retroelements  as  will 
be  performed  in  this  work. 

One  additional  implication  of  these  results  also  becomes 
clear.  The  simplified  models  used  in  prior  approaches  at  best 
produce  weak  correlations.  However,  if  one  considers  a  more 
formal  description  of  the  underlying  patterns  of  methylation 
changes,  methylation  and  expression  data  are  highly 
correlated.  This  tool  was  designed  to  start  from  a  list  of 
expression  data,  corresponding  transcription  start  sites  (TSSs) 
and  high-resolution  genome -wide  methylation  data  such  as 
from  Methyl-MAPS.  Thus  it  fits  perfectly  into  the  framework 
of  this  proposal  where  we  have  obtained  RNA-seq  expression 
and  Methyl-MAPS  methylation  data  for  each  sample.  We 
have  further  adapted  our  tool  to  address  the  regulation  of 
retrotransposons  promoters  and  we  have  performed  some 
initial  analyses  of  LTR-driven  IncRNAs  as  a  successful 
proof-of-concept. 

Expression  Analysis  of  uni-  and  triparous  mice 
RNA-Seq  analyses  using  long-paired  end  reads  were 
performed  for  two  replicates.  One  “replicate”  consists  of  four 
samples,  or  each  combination  of  uni-  or  triparous,  and  control 
or  SV40+/-  mice.  See  Dr.  Peaston’s  report  for  a  complete 
description  of  the  mouse-breeding  design  and  results.  For  each  sample,  30-46  million  paired  reads  were  aligned 
to  the  mouse  genome  (mmlO)  using  Tophat  (Trapnell  et  al.  2009).  Mapped  reads  were  aligned  to  Ensembl 
genes  using  htseq  and  differential  expression  was  computed  using  edgeR  (Robinson  et  al.  2010).  Genes  that 
were  at  least  2-fold  up  or  down  regulated  with  and  FDR  (False  Discovery  Rate)  <  0.05  were  deemed  significant. 
To  focus  on  genes  that  were  differentially  expressed  in  triparous  mice,  but  were  not  up-regulated  solely  due  to 
the  induction  of  SV40  in  tumor-prone  mice,  we  removed  all  genes  that  were  up-  or  down-regulated  in  uniparous 
mice.  This  left  318  genes  that  were  uniquely  down  and  34  uniquely  up  in  tumor-prone  triparous  mice  (Fig.  2). 
Ontology  analysis  using  DAVID  (Huang  et  al.  2009)  of  down-regulated  genes  indicated  that  these  genes  were 
over-represented  for  genes  involved  in  the  cell  cycle  and  immune  response.  Genes  regulating  cell  cycle  control 
could  represent  the  early  inactivation  of  checkpoints  that  create  a  permissive  environment  for  tumor 
proliferation.  We  are  currently  following  up  on  many  of  these  genes  to  understand  how  they  may  be  involved  in 
setting  the  regulatory  environment  to  facilitate  tumor  development  in  these  mice. 

Surprisingly,  when  we  analyzed  retrotransposon  derived  sequences  we  found  their  expression  dropped  in  tumor- 
prone  triparous  mice  by  an  average  of  ~18%.  This  occurred  across  nearly  every  transposable  element  family. 
We  are  currently  working  to  validate  these  findings.  Analyses  of  chimeric  transcripts  (containing  both 
retroelement  and  coding  sequence)  are  underway. 


Log2(SV40/Ctrl)  Uniparous 


Figure  2.  RNA-seq  analysis  of  SV40+/-  and 
control,  uniparous  and  triparous  mice.  Genes 
differentially  expressed  only  in  SV40+/-  versus 
control  (Ctrl)  triparous  mice  are  indicated  by  red 
dots,  while  genes  whose  expression  changes  by 
similar  amounts  in  both  uni-  and  triparous  mice 
are  indicated  in  green. 
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The  average  methylation 
levels  of  100  kb  bins  is 
shown  in  Fig.  2.  These  data 
indicate  that  methylation 
levels  are  constant  in 
SV40+/-  and  control 
uniparous  mice.  However, 
methylation  levels  appear 
to  drop  (—3.5%)  in  tumor- 
prone  SV40+/-  triparous 
mice  versus  controls.  As 
indicated  in  Fig.  3  and  4, 
this  appears  to  be  a 
genome-wide  phenomena 
and  is  not  consistent  with 
this  decrease  being  caused 
by  long-hypomethylated 
domains  as  has  been 

recently  described  in 

several  tumor  types 

including  breast  (Berman  et 
al  2012,  Hon  et  al  2012). 
Methylation  decreases  were 
found  in  all  genomic 
compartments  including 
genes,  promoters,  and 
repeat  elements.  Meta-gene 
analysis  further  showed  that 
methylation  decreases  in 
tumor-prone  triparous 
mouse  were  found  across 
the  entirety  of  the  gene  (Fig. 
4).  This  decrease  appeared 
to  be  specific  to  the  tumor- 
prone  triparous  mice  and 
could  indicate  that  the 
mechanisms  that  drive  the 
large-scale  genomic 

hypomethylation  observed 
in  breast  tumors  may  be 
active  in  this  early  stage. 


Figure  4.  Meta-gene  analysis  of  Methyl-MAPS  data.  The  average  methylation  is 
plotted  for  well-annotated  RefSeq  genes.  Red  and  yellow  ticks  indicate  regions  whose 
methylation  is  statistically  different  (red  is  FDR  <  10'4,  yellow  is  FDR  <  0.05). 


Methyl-MAPS  Analysis  of  uni-  and  triparous  mice 

Methyl-MAPS  analysis  was  performed  for  one  experimental  replicate.  Methylation  levels  were  determined  for 
~5  million  CpG  sites  per  sample  at  an  average  coverage  level  from  15-27x.  We  encountered  difficulties  during 
the  final  stages  of  library  construction  for  replicate  2.  We  are  working  to  salvage  these  samples  and  perform 
Methyl-MAPS  analysis.  For  replicate  3  it  was  determined  that  there  was  insufficient  material  to  build 
sequencing  libraries  and  thus  the  DNA  has  been  held  for  validation  of  target  regions  identified  from  the  other 
replicates. 


0  0.2  0.4  0.6  0.8  1 

Avg.  Meth.  (lOkb  bins)  Uniparous  Ctrl 


0.2  0.4  0.6  0.8  1 

Avg.  Meth.  (lOkb  bins)  Triparous  Ctrl 


Figure  3.  Correlation  plot  of  average  methylation  levels  computed  using  lOkb 
windows  for  uniparous  (left  panel)  and  triparous  (right  panel)  SV40+/-  and  control 
(Ctrl)  mice. 
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Integrated  Methylation  and  Expression  Analysis 

First  we  performed  a  meta-analysis  to  determine  whether  typical  relationships  between  methylation  and 
expression  in  individual  samples  were  found  in  this  dataset.  As  can  be  seen  in  Fig.  5.  There  is  a  clear  inverse 
correlation  between  methylation  in  the  region  from  500  bp  to  lkb  to  the  3’  side  of  the  TSS  and  expression 
quartile.  This  phenomenon  is  in  line  with  previous  reports  in  tumor  and  normal  cells  (Irizarry  et  al.  2009, 
VanderKraats  et  al.  2013).  We  also  find  a  positive  correlation  between  gene  body  methylation  and  expression 
quartile  as  has  also  been  commonly  observed  (Yang  et  al.  2014). 


Next,  we  used  the  WIMSi  and  DTW-kNN  methods  described  above  to  analyze  relationships  between  DNA 
methylation  and  expression  at  specific  gene  promoters.  We  first  ran  WIMSi,  the  unsupervised  method  to  search 
for  any  patterns  of  methylation  associated  with  transcriptional  changes  at  gene  promoters  that  were  unique  to 
this  dataset.  WIMSi  analysis  failed  to  yield  any  reliable  genes  suggesting  either  that  any  novel  patterns  did  not 
occur  often  enough  or  that  such  patterns  did  not  exist.  Our  previous  analyses  suggested  that  the  dataset  must 
contain  from  15-20  examples  of  a  pattern  to  be  reliably  detected  (VanderKraats  et  al.  2013).  Thus  this  analysis 
would  indicate  there  are  few,  if  any,  genes 
with  correlated  promoter  methylation  and 
expression  levels  in  these  datasets. 


We  next  ran  the  DTW-kNN  method  by 
building  a  model  from  Methy  1-MAPS  and 
RNA-seq  data  from  five  breast  cancer  cell 
lines.  Using  this  model  we  evaluated 
whether  there  were  genes  that  changed  in 
the  SV40  +/-  versus  control  in  both 
uniparous  and  triparous  mice.  We  were 
unable  to  detect  any  genes  even  when  we 
reduced  the  accuracy  cutoff  to  85%.  This 
suggests  that  there  is  not  wide-spread 
promoter  methylation  contributing  to  the 
expression  changes  we  observe  in 
triparous  mice.  This  was  surprising,  since 
in  our  analysis  of  five  breast  tumors  from 
published  TCGA  data,  we  found  between 
55  and  606  genes  with  correlated 
methylation  and  expression  levels. 
Analysis  of  the  methylation  of  promoters 
of  retrotransposon/coding  sequence 
chimeric  transposons  will  be  performed 
with  the  same  pipeline  once  the  chimeric 
transcript  analysis  is  completed. 

Taken  as  a  whole,  while  we  do  identify  a 
small  amount  of  genomic  hypomethylation 
in  tumor-prone  triparous  mice,  these 
changes  do  not  appear  to  be  associated 
with  increased  transcriptional  activity  at 
either  retrotransposons  or  coding  genes.  It 
also  appears  that  the  promoter  methylation 
changes  that  are  common  in  primary 
tumors  are  not  found  in  this  early  stage. 
We  conclude  that  the  mechanisms  that 
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cause  demethylation  may  be  active  and  may  create  a  permissive  environment  for  later  transcriptional  changes  as 
tumors  arise.  However,  these  methylation  changes  do  not  appear  to  affect  transcription  of  retrotransposons  nor 
coding  genes  at  this  stage.  In  the  future,  we  will  characterize  methylation  and  expression  from  tumors  arising  in 
tumor-prone  triparous  mice  to  further  understand  when  the  large-scale  remodeling  of  DNA  methylation  occurs 
and  how  these  methylation  changes  facilitate  transcriptome  remodeling. 
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Statement  of  Work 

The  relevant  sections  from  the  Statement  of  Work  are  shown  in  the  table  below  with  corresponding  goals  and 
results.  We  are  in  the  process  of  finalizing  our  analyses  and  performing  validation  of  specific  hypotheses  this 
work  has  generated.  We  hope  to  have  a  manuscript  prepared  and  submitted  detailing  our  findings  of  the  analysis 
of  uniparous  and  tumor-prone  triparous  mice  in  the  next  six  months. 


Year  1  &  2:  Items  from  Statement  of  Work  Relevant  to  Edwards  Lab. 


Year, 

Months 

Goal 

Result 

Y1 

4. 

•  Set  up  schedule  for  formal  monthly 

•  Completed 

1-3 

9. 

electronic  lab  meeting  between  Peaston  lab 

4-6 

5. 

and  Edwards  lab.  And  regularly  hold 

7-9 

7. 

meetings. 

10-12 

Y2 

8. 

1-3 

8. 

4-6 

9,10,11- 

7-9 

8. 

10-12 

Y1 

6. 

•  Preliminary  Methyl-MAPS  analysis  of  pilot 

•  This  material  was  never  received.  Based  on 

4-6 

virgin  samples 

delays  in  the  mouse  breeding  and 
conversations  with  the  Dr.  Peaston,  it  was 
decided  to  move  straight  to  the  primary 
analyses. 

Y1 

7. 

•  Methyl-MAPS  library  preparation  and 

•  Completed 

10-12 

sequencing  for  replicate  #1  uniparous  & 
triparous  control  and  tumor-prone 

Y2 

6. 

•  Preliminary  Methyl-MAPS  analysis  of 

•  Completed 

1-3 

7. 

replicate  #1 

•  Review  DNA  loci  of  interest  for  bisulfite 

•  Initial  review  completed.  Bisulfite  sequencing 

sequencing  in  light  of  PCR  results  and 
library  analysis,  Continue  bisulfite 
sequencing  assessment  of  DNA  loci  of 
interest 

is  commencing. 

Y2 

3. 

•  Methyl-MAPS  library  preparation  and 

•  Material  was  received.  We  encountered 

4-6 

sequencing  for  replicate  #2  uniparous  & 

problems  with  library  construction  due  to  the 

5. 

triparous  control  and  tumor-prone 

low  amount  of  DNA  provided.  We  are 

•  Library  analyses  replicate  #2  and 

working  to  salvage  these  samples  and  plan  to 

preliminary  comparisons  with  replicate  #1 

sequence  shortly. 

Y2 

4. 

•  Methyl-MAPS  library  preparation  and 

•  Material  was  received.  Based  on  results  from 

7-9 

sequencing  for  replicate  #3  uniparous  & 

replicate  #2,  there  is  insufficient  material  to 

triparous  control  and  tumor-prone 

construct  the  libraries  for  sequencing.  A 
strategic  decision  was  made  to  hold  the 
material  for  validation,  rather  than  attempt 
library  construction. 

Y2 

2. 

•  Finalize  data  analysis 

•  Analysis  pipelines  have  been  developed  and 

10-12 

3. 

•  Prepare  a  report  for  publication  of  the  results 

are  in  place.  Initial  analysis  is  complete. 

5. 

Deeper  analysis  will  be  finished  shortly. 

Reports  will  be  finalized  once  the  analyses 
and  validation  are  finished. 
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Key  Research  Accomplishments 

•  We  developed  new  computational  tools  to  combine  genome -wide  expression  and  methylation  data  to 
output  a  list  of  genes  where  methylation  likely  contributes  to  their  silencing  or  activation. 

•  We  performed  genome -wide  methylation  and  expression  profding  of  SV40+/-  and  control  uni-  and  tri- 
parous  mice. 

•  We  found  a  small  amount  of  genomic  demethylation  specific  to  tumor-prone  triparous  mice  that  was  not 
accompanied  with  corresponding  changes  in  the  transcrip  tome.  These  changes  indicate  that  the 
mechanisms  driving  demethylation  in  breast  tumors  may  have  activated,  even  though  they  may  not  yet 
contribute  to  transcriptome  remodeling  at  this  early  stage. 
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Reportable  Outcomes 

Manuscripts 

Vanderkraats  ND,  Hiken  JF,  Decker  KF,  Edwards  JR.  (2013)  Discovering  high-resolution  patterns  of 
differential  DNA  methylation  that  correlate  with  gene  expression  changes.  Nucleic  Acids  Res.  41:6816-6827. 

Lund  K,  Cole  JJ,  VanderKraats  ND,  McBryan  T,  Pchelintsev  NA,  Clark  W,  Copland  M,  Edwards  JR,  Adams 
PD.  (2014)  DNMT  inhibitors  reverse  a  specific  signature  of  aberrant  promoter  DNA  methylation  and  associated 
gene  silencing  in  AML.  Genome  Biol.  15:406. 

Cruickshanks  HA,  McBryan  T,  Shah  PP,  Nelson  DM,  Donahue  G,  VanderKraats  ND,  Edwards  JR,  Berger  SL, 
Adams  PD.  (2013)  Features  of  the  cancer  epigenome  are  acquired  as  primary  human  cells  approach  senescence. 
Nature  Cell  Biology.  15:1495-1506. 

Boulard  M,  Edwards  JR,  Bestor  TH.  (2015)  “FBXL10  protects  Polycomb-bound  genes  from 
Hypermethylation.”  Nat.  Genetics.  In  press. 


Invited  Manuscripts 

Bestor  TH,  Edwards  JR,  Boulard  M.  (2014)  Notes  on  the  role  of  dynamic  DNA  methylation  in  mammalian 
development.  PNAS.  Published  ahead  of  print  November  3,  2014,  doi:10.1073/pnas.l415301111. 


Abstracts 

Schlosberg  CE,  VanderKraats  ND,  Hiken  JF,  Weinberger  KQ,  Ju  T,  Edwards  JR.  (2014)  "Modeling  complex 
patterns  of  differential  DNA  methylation  that  strongly  associate  with  gene  expression  changes."  22nd  Annual 
International  Conference  on  Intelligent  Systems  for  Molecular  Biology,  July  13-15. 
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Conclusion 

The  major  conclusion  from  this  work  is  that  at  this  early  stage  of  tumor  development,  we  find  only  a  small 
amount  of  genome-wide  hypomethylation  in  tumor-prone  triparous  mice  and  do  not  observe  widespread  DNA 
methylation  changes  that  are  commonly  found  in  tumors.  Correlated  methylation  changes  at  promoters  and 
expression  changes  are  found  at  hundreds  of  genes  in  primary  human  tumors,  but  none  were  observed  in  the 
tumor-prone  mice  in  this  study.  Likely  the  mechanisms  that  lead  to  the  genome-wide  hypomethylation  observed 
in  primary  tumors  have  activated,  but  they  are  not  impacting  transcription  of  retrotransposons  nor  coding  genes 
at  this  early  stage. 

Expression  analyses  indicate  that  the  tumor-prone  triparous  mice  experience  altered  transcription  of  several 
hundred  transcripts.  Many  of  these  transcripts  are  found  in  cell  cycle  control  and  could  represent  the  early 
inactivation  of  checkpoints  to  control  proliferation.  There  does  not  appear  to  be  wide-spread  methylation  or 
expression  changes  in  retrotransposon  sequences,  indicating  that  these  likely  play  little  to  no  role  in  this  stage. 

During  the  course  of  this  proposal  we  have  developed  a  new  set  of  computation  tools  that  potentially  have  broad 
significance  beyond  the  work  here.  The  computational  tools  we  have  developed  are  designed  to  work  with 
annotated  genes  as  we  have  outlined,  but  can  also  be  expanded  to  any  transcriptional  unit  with  a  known  TSS 
and  known  expression  value.  Further  we  have  demonstrated  how  they  can  be  directly  used  to  study  which  genes 
are  correlated,  and  thus  potentially  regulated,  by  DNA  methylation  in  cancer  cells.  We  have  also  shown  how  we 
can  use  these  tools  to  identify  genes  that  are  potentially  up-regulated  in  direct  response  to  demethylating  agents 
such  as  5-azacitidine  and  Decitabine.  While  DNA  demethylating  agents  have  primarily  been  employed  to  treat 
Myelodysplastic  syndrome  and  AML,  recently  these  drugs  have  been  explored  as  potential  therapeutics  in  solid- 
tumors  such  as  lung  and  breast  cancer.  We  believe  this  work  has  the  potential  to  shed  light  on  which  patients 
have  genes  that  can  be  potentially  re-activated  by  these  drugs,  and  thus  which  patients  can  potentially  benefit 
from  this  line  of  therapy. 
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ABSTRACT 

Methylation  of  the  CpG-rich  region  (CpG  island) 
overlapping  a  gene’s  promoter  is  a  generally 
accepted  mechanism  for  silencing  expression. 
While  recent  technological  advances  have  enabled 
measurement  of  DNA  methylation  and  expression 
changes  genome-wide,  only  modest  correlations 
between  differential  methylation  at  gene  promoters 
and  expression  have  been  found.  We  hypothesize 
that  stronger  associations  are  not  observed 
because  existing  analysis  methods  oversimplify  their 
representation  of  the  data  and  do  not  capture  the  di¬ 
versity  of  existing  methylation  patterns.  Recently, 
other  patterns  such  as  CpG  island  shore  methylation 
and  long  partially  hypomethylated  domains  have  also 
been  linked  with  gene  silencing.  Here,  we  detail  a  new 
approach  for  discovering  differential  methylation 
patterns  associated  with  expression  change  using 
genome-wide  high-resolution  methylation  data:  we 
represent  differential  methylation  as  an  interpolated 
curve,  or  signature,  and  then  identify  groups  of  genes 
with  similarly  shaped  signatures  and  corresponding 
expression  changes.  Our  technique  uncovers  a 
diverse  set  of  patterns  that  are  conserved  across  em¬ 
bryonic  stem  cell  and  cancer  data  sets.  Overall,  we 
find  strong  associations  between  these  methylation 
patterns  and  expression.  We  further  show  that  an 
extension  of  our  method  also  outperforms  other 
approaches  by  generating  a  longer  list  of  genes 
with  higher  quality  associations  between  differential 
methylation  and  expression. 

INTRODUCTION 

DNA  methylation  is  an  important  factor  in  transcrip¬ 
tional  regulation,  playing  a  role  in  genomic  imprinting, 
X-inactivation,  retrotransposon  silencing  and  the  control 


of  tissue-specific  genes  during  differentiation  (1).  DNA 
methylation  patterns  are  frequently  altered  in  tumors 

(2) ,  and  there  is  great  interest  in  understanding  how 
changes  to  these  patterns  contribute  to  human  disease 

(3) .  Even  so,  how  alterations  to  DNA  methylation  affect 
gene  transcription  remains  poorly  characterized.  Over 
60%  of  genes  have  a  CpG-rich  region,  termed  a  CpG 
island,  overlapping  their  promoter  (4).  Classically,  it  is 
thought  that  hypermethylation  of  promoter-associated 
CpG  islands  silences  transcription.  However,  it  was 
recently  shown  that  cancer-  and  tissue-specific  methyla¬ 
tion  variation  in  adjacent  regions,  termed  CpG  island 
shores,  is  also  associated  with  gene  expression  change 

(5) .  Additionally,  genes  are  more  likely  to  be  repressed 
when  they  are  located  in  partially  methylated  domains 

(6)  or  long  hypomethylated  domains  (7,8)  in  cancer. 

Techniques  such  as  whole-genome  bisulfite  sequencing 

(WGBS)  (9)  and  Methyl-MAPS  (10)  have  recently  been 
developed  to  map  methylation  at  single-base  resolution 
genome-wide.  Methods  to  interpret  this  data,  however, 
are  lacking.  Current  computational  techniques  are 
mostly  concerned  with  the  visualization  of  genome-level 
correlations  between  DNA  methylation  and  other  epigen¬ 
etic  marks,  or  with  the  identification  of  regions  that  are 
differentially  marked  between  samples  [recently  reviewed 
in  (11)].  These  tools  have  elucidated  the  genomic  organ¬ 
ization  of  these  marks,  but  they  do  not  sufficiently  address 
how  changes  at  individual  loci  associate  with  and  poten¬ 
tially  affect  function. 

The  most  common  approach  for  characterizing  methy¬ 
lation  changes  between  two  samples  uses  a  sliding  window 
to  identify  differentially  methylated  regions  (DMRs)  (7,9). 
A  gene  with  a  hypermethylated  DMR  near  its  promoter  is 
assumed  to  exhibit  a  decrease  in  expression,  while  a  gene 
near  a  hypomethylated  DMR  should  exhibit  an  increase  in 
expression.  In  practice,  the  Pearson  correlation  coefficient 
between  the  methylation  level  of  the  DMR  and  the  expres¬ 
sion  of  its  associated  gene  is  around  —0.3  (11,12).  It  has 
been  assumed  that  better  anticorrelation  is  precluded 
owing  to  noise  from  experimental  error,  mixed  cellular 
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populations,  copy  number  variations,  chromatin  modifiers 
or  other  regulation  events.  Another  explanation,  however, 
is  that  contemporary  analysis  methods  are  not 
sophisticated  enough  to  recognize  relationships  involving 
more  complex  methylation  patterns.  Existing  approaches 
summarize  their  representation  of  methylation  change  in 
the  promoter  region  to  simplify  analysis,  but  this  sacrifices 
potentially  important  spatial  information  contained  in  the 
locations  of  the  constituent  sites. 

To  discover  DNA  methylation  changes  that  associate 
with  gene  expression  changes,  we  propose  a  new  method 
that  uses  the  entire  differential  methylation  profile  in  the 
vicinity  of  a  gene’s  promoter  (Figure  1).  We  represent  the 
differential  methylation  for  a  fixed  area  around  each 
gene’s  transcription  start  site  (TSS)  as  a  continuous 
curve,  or  signature,  capturing  the  shape  of  the  methylation 
changes.  We  then  apply  a  curve  similarity  metric,  the 
discrete  Frechet  distance,  to  compare  differential  methy¬ 
lation  signatures  for  all  genes.  Using  an  unsupervised  clus¬ 
tering  technique,  we  arrange  the  signatures  according  to 
their  shapes  and  identify  which  clusters  of  genes  exhibit 
statistically  significant  changes  in  expression.  Generalized 
patterns  of  differential  methylation  can  be  extrapolated 
from  the  resulting  clusters.  Because  the  approach  is  un¬ 
supervised,  no  assumptions  need  to  be  made  about  the 
direction  of  a  correlation  between  methylation  and  expres¬ 
sion.  Although  designed  for  pattern  discovery,  the 
method  is  easily  extended  to  identify  a  list  of  genes  poten¬ 
tially  regulated  by  methylation.  These  gene  lists  are  of 
markedly  greater  length  and  have  higher  quality  associ¬ 
ations  between  differential  methylation  and  expression 
than  those  generated  by  existing  methods.  A  current 
implementation  of  the  approach  can  be  found  on  the 
project  website  at  http://epigenomics.wustl.edu/WIMSi/. 

MATERIALS  AND  METHODS 

Methylation  signatures 

Using  WGBS  or  Methyl-MAPS  data  (single-base  reso¬ 
lution)  to  compare  methylation  patterns  between  different 
genes’  promoter  regions  is  complicated  by  the  variability 
in  the  locations  of  their  CpGs.  To  enable  comparisons 
between  genes,  we  standardize  the  representation  of  dif¬ 
ferential  methylation  data  for  each  gene  by  creating  a 
methylation  signature  across  a  fixed-width  region 
centered  at  the  TSS  (Figure  1).  In  a  single  sample,  the 
methylation  level  at  each  probed  CpG  is  represented  as 
a  continuous  value  between  0  and  1,  denoting  fully 
unmethylated  and  methylated,  respectively.  To  compare 
methylation  between  two  samples,  we  subtract  these 
values  at  each  site  to  produce  a  differential  methylation 
score  that  ranges  from  —1  to  1,  denoting  complete 
hypomethylation  and  hypermethylation,  respectively. 

We  create  a  gene’s  methylation  signature  on  a  fixed- 
width  target  region  by  interpolating  the  differential  methy¬ 
lation  scores  using  piecewise  cubic  Hermite  interpolation 
(Figure  IB  and  C)  across  all  CpG  sites  within  the  region 
and  between  one  and  five  CpG  sites  flanking  the  region  on 
either  side.  Because  methylation  is  highly  correlated  over 
short  distances  (Supplementary  Figure  SI)  (13), 


interpolation  provides  a  suitable  estimate  for  differential 
methylation  in  areas  with  missing  data.  Such  missing  data 
points  can  occur  owing  to  insufficient  coverage  or  experi¬ 
mental  limitations  (e.g.  data  collected  only  at  specific  re¬ 
striction  sites).  Interpolated  values  always  lie  between  —1 
and  1.  Regions  with  fewer  than  five  CpG  sites  with  suffi¬ 
cient  coverage  and  regions  with  fewer  than  one  flanking 
site  per  side  are  discarded.  We  also  discard  regions  con¬ 
taining  no  CpG  sites  with  absolute  differential  methyla¬ 
tion  >0.2.  Lastly,  we  apply  Gaussian  smoothing  on  the 
curves  (a  =  50  bp)  to  help  moderate  noise  due  to  experi¬ 
mental  artifacts  such  as  missing  or  inaccurate  measure¬ 
ments.  One  advantage  of  using  a  combination  of 
interpolation  and  smoothing  is  that  it  improves  perform¬ 
ance  of  the  method  for  low  coverage  data.  Previously  it 
was  shown  that  statistical  smoothing  was  an  effective 
method  to  analyze  low  coverage  WGBS  data  (14).  The 
resulting  methylation  signatures  for  each  gene  are 
bounded  between  —1  and  1  on  a  fixed  region  relative  to 
the  TSS. 

Determining  clusters  of  methylation  signatures  with 
significant  expression  changes 

We  compare  methylation  signatures  between  genes  using 
the  discrete  Frechet  distance,  also  known  as  the  coupling 
distance  (15,16).  The  Frechet  distance  is  informally  known 
as  the  dog-man  distance  because  it  represents  the 
minimum  length  of  leash  necessary  for  a  person  traversing 
one  curve  to  walk  a  dog  along  another,  assuming  neither 
party  is  allowed  to  walk  backwards.  The  Frechet  metric  is 
advantageous  because  it  is  efficiently  computable,  while 
still  taking  into  account  the  entire  course  of  the  curves. 
In  particular,  it  appeals  to  an  intuitive  notion  of  similarity 
between  methylation  signatures  in  that  two  curves  with 
similar  shape  will  have  a  low  distance,  even  if  one  curve 
is  shifted  slightly  from  the  other  relative  to  the  TSS 
(Supplementary  Figure  S2).  Because  Frechet  distance  is 
calculated  in  Euclidean  space,  a  scaling  parameter  must 
specify  the  relationship  between  the  x-axes  of  the  curves, 
in  bp,  and  the  y-axes,  in  differential  methylation  level. 
Intuitively,  this  parameter  controls  how  far  peaks  in  the 
y-direction  are  allowed  to  slide  across  the  x-axis  and  still 
be  identified  as  similar  between  two  genes. 

Formally,  a  methylation  signature,  or  curve, 
can  be  described  as  a  continuous  mapping 
f :  [0,1]  [0,nb]  x  [—1,1]  where  is  the  fixed  length  of 

the  region  in  base  pairs.  For  two  curves  A  and  B,  the 
Frechet  distance  is  defined  as  follows: 

&Frechet{A,B)  =  inf  ma  xd(A(a(t)),B(P(t))) 

a,p  t 

where  te[ 0,1],  d(x,y )  is  the  Euclidean  distance  between  x 
and  y,  and  a(t)  and  f$(t)  are  continuous,  monotonically 
increasing  functions  from  [0,1]  to  [0,1]  such  that 
of(0)  =  0)  =  0  and  a(l)  =  /3(1)  =  1.  For  any  two  differ¬ 

ential  methylation  signatures,  we  calculate  similarity 
using  the  coupling  distance,  which  can  be  computed  on 
polygonal  curves  in  quadratic  time  using  a  dynamic 
programming  algorithm  (16).  Each  continuous 
interpolated  curve  is  converted  into  a  polygonal  curve 
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Figure  1.  Method  overview  and  example  methylation  signatures.  (A)  Overview  of  the  approach  for  associating  spatially  similar  DNA  methylation 
changes  with  corresponding  changes  in  transcription.  (B,  C)  Example  methylation  signatures  from  HMEC-HCC1954  WGBS  data  for  the  tumor 
suppressor  gene  CDH4  and  the  LPCAT3  gene.  Conversion  of  methylation  data  to  signatures  allows  direct  comparison  of  the  data  between  genes 
with  different  distributions  of  CpG  sites  relative  to  their  TSSs.  Top  panels  show  methylation  (blue  is  unmethylated  and  red  is  methylated)  and  RNA- 
Seq  expression  data  on  the  UCSC  genome  browser.  Bottom  panels  show  interpolated  and  smoothed  methylation  signatures  (black  curve)  that  are 
used  to  calculate  the  discrete  Frechet  distance.  Blue  tick  marks  show  locations  of  all  CpG  sites.  Black  dots  mark  experimentally  measured  differences 
in  methylation  between  the  two  samples. 


by  averaging  every  10  bp,  yielding  ceiling  1 10)  vertices  for 
the  nb  bases  in  the  target  region.  For  sparsely  sampled 
areas,  this  fixed  resolution  greatly  limits  the  discretization 
error  between  our  discrete  Frechet  distance  and  the  con¬ 
tinuous  version  because  this  error  is  bounded  by  the 
maximum  edge  length  for  each  pair  of  curves  (17). 
Averaging  every  10  bp  allows  the  algorithm  to  run  faster 
than  sampling  every  2  bp.  Because  the  averaging  interval  is 
considerably  smaller  than  the  width  of  the  Gaussian 
smoothing  kernel,  it  has  little  effect  on  the  resulting  dis¬ 
tances.  We  compared  the  results  of  clustering  for  several 
experiments  using  10  bp  averaging  and  dinucleotide 
sampling,  and  found  no  differences  in  the  type  and 
number  of  patterns  found  for  the  HMEC-HCC1954 
data  set.  By  default,  scaling  between  the  x-  and  y-axes 
was  set  such  that  2500  bp  along  the  x-direction  was 
equivalent  to  one  unit  of  differential  methylation  in  the 
y-direction.  We  tested  a  large  number  of  values  for  this 
ratio.  Manual  inspection  of  resulting  clusterings  showed 
that  the  final  set  of  patterns  discovered  was  not  sub¬ 
stantially  altered  despite  moderate  changes  to  the  scaling 
ratio. 

Methylation  signatures  are  arranged  using  unsupervised 
complete-linkage  agglomerative  hierarchical  clustering 
based  solely  on  the  discrete  Frechet  distance  between 


curve  pairs.  After  clustering,  we  introduce  the  differential 
expression  data  to  identify  clusters  of  genes  with  similar 
expression  differences.  Expression  data  can  be  obtained 
through  any  method,  including  RNA-Seq  and  expression 
array  analysis.  To  focus  on  genes  for  which  differential 
methylation  and  expression  could  be  related,  we  consider 
only  genes  with  greater  than  2-fold  expression  change. 

We  identify  clusters  from  the  dendrogram  where  methy¬ 
lation  is  significantly  associated  with  expression  as 
follows.  For  each  cluster,  we  evaluate  the  likelihood  that 
the  observed  group  of  expression  values  is  atypical 
compared  with  the  distribution  of  expression  values  over 
the  entire  training  set  [a  dendrogram  on  ns  signatures 
contains  exactly  (ns- 1)  clusters].  We  determine  significance 
using  a  two-sample  Kolmogorov-Smirnov  test  between 
the  set  of  differential  expression  values  in  each  cluster 
and  the  entire  population  of  expression  values.  The  false 
discovery  rate  is  controlled  at  0.05  using  the  Benjamini- 
Hochberg  procedure.  This  estimate  of  significance  is  con¬ 
servative  because  one  sample  is  a  subset  of  the  other. 

After  all  statistically  significant  clusters  are  identified, 
we  select  a  set  of  nonoverlapping  clusters  using  an  iterative 
algorithm  to  define  the  trade-off  between  grouping  similar 
patterns  together  versus  breaking  them  into  distinct 
clusters.  We  seek  a  balance  between  the  selection  of 
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larger  clusters  that  embody  more  general  patterns  and  the 
conformity  between  differential  expression  values,  called 
purity.  The  purity  of  a  set  of  genes  is  defined  as  the 
fraction  of  genes  that  have  expression  change  in  the 
same  direction  as  the  majority.  The  full  process  is 
described  in  the  Supplementary  Methods.  We  used  a 
minimum  purity  of  0.85  unless  otherwise  stated. 

Generating  a  gene  list 

To  produce  a  list  of  genes  with  associated  differential 
methylation  and  expression,  we  ran  the  discovery  method 
on  a  set  of  overlapping  5  kb  regions  centered  at  a  fixed  set 
of  locations  around  the  TSS  (Figure  6A).  The  scaling  factor 
between  the  x-  and  y-dimensions,  minimum  cluster  purity 
and  all  interpolation  parameters  were  the  same  as  previ¬ 
ously  described.  We  used  22  regions,  spanning  an  area 
from  [-25  kb,  25  kb]  relative  to  the  TSS.  The  area 
[—10  kb,  10  kb]  relative  to  the  TSS  was  covered  more 
densely  because  this  was  where  the  majority  of  relevant 
differential  methylation  features  were  observed  by  the 
pattern  discovery  tool.  The  leftmost  boundaries  (in  kb, 
relative  to  the  TSS)  were:  —25,  —20,  —15,  —10,  —9,  —8, 
-7,  -6,  -5,  -4,  -3,  -2,  -1,  0,  1,  2,  3,  4,  5,  10,  15  and  20. 
For  each  region,  we  recorded  the  set  of  genes  identified  as 
positives  (i.e.  changing  in  the  same  direction  as  the  majority 
of  their  cluster).  Genes  that  are  identified  for  at  least  m 
regions  are  added  to  the  final  list.  Unless  otherwise 
stated,  an  m  of  two  regions  was  used. 

RESULTS 

Pattern  discovery  in  high-resolution  methylation  data 

We  evaluated  our  technique  using  three  primary  and  nine 
additional  comparisons  (17  total  data  sets)  with  high-reso¬ 
lution  methylation  and  RNA-Seq  expression  data 
(Supplementary  Table  SI).  The  first  primary  data  set  was 
WGBS  of  nontumorigenic  human  mammary  epithelial  cells 
(HMEC)  and  breast  cancer  cells  (HCC1954)  (6)  containing 
methylation  data  for  84.7%  of  genomic  CpGs  with  a 
coverage  level  of  at  least  10  in  each  sample 
(Supplementary  Table  SI,  GEO:  GSE29127).  We  focused 
many  of  the  analyses  in  this  article  on  this  HMEC- 
HCC1954  data  set  because  it  has  high  coverage  and 
contains  examples  of  all  the  methylation  patterns  we  dis¬ 
covered  (Supplementary  Data  1).  To  examine  how  the 
method  performed  on  a  lower  coverage  data  set,  we 
examined  WGBS  data  for  HI  embryonic  stem  (ES)  cells 
and  IMR90  fetal  lung  fibroblasts  (9)  (GEO:  GSE16256). 
While  the  genomic  coverage  level  of  this  data  was  high, 
the  data  was  sparsely  sampled  at  promoters:  <40%  of 
CpGs  had  coverage  of  at  least  10  (Supplementary  Figure 
S3).  By  including  all  CpGs  with  coverage  as  low  as  a  single 
read,  the  data  covered  93.5%  of  genomic  CpGs.  However, 
methylation  scores  are  expected  to  be  less  accurate  in 
regions  with  lower  sampling.  WGBS  data  was  processed 
and  methylation  scores  computed  as  in  (6).  Analysis  was 
limited  to  only  CpG  methylation  (complete  results  are  in 
Supplementary  Data  2).  Lastly,  to  validate  our  findings 
using  data  from  an  alternative  experimental  method,  we 
generated  Methyl-MAPS  data  from  MCF7  and  T47D 


breast  cancer  cells.  Methyl-MAPS  uses  methylation- sensi¬ 
tive  and  -dependent  restriction  enzyme  digests  followed  by 
high-throughput  sequencing  to  identify  methylation  levels 
at  individual  CpGs  (10).  Libraries  were  constructed, 
sequenced  and  analyzed  as  in  (10)  (see  Supplementary 
Methods  for  further  details).  We  limited  our  analysis  to 
sites  interrogated  by  both  digests,  which  included  24.9% 
of  genomic  CpGs  with  coverage  of  at  least  five  to  ensure 
an  adequate  number  of  CpGs  with  data  around  each 
promoter.  Expression  data  for  each  sample  came  from 
poly(A)  selected  RNA-Seq  experiments  (see 
Supplementary  Methods  for  further  details).  All  Methyl- 
MAPS  and  RNA-Seq  data  are  available  from  GEO,  acces¬ 
sion  GSE45337.  Complete  results  are  in  Supplementary 
Data  3. 

In  addition  to  these  three  primary  comparisons,  we 
applied  our  method  to  WGBS  and  RNA-Seq  data 
comparing  H9  ES  cells  (18,19)  to  IMR90  cells,  HI  cells 
to  four  HI -derived  differentiated  cell  types,  female 
adipose-derived  stem  cells  (ADS)  to  ADS-derived  adipo¬ 
cytes  and  ADS-derived  induced  pluripotent  stem  cells 
(ADS-iPSCs)  (19)  and  primary  mouse  ES  cells  to 
isolated  sperm  and  oocytes  (20). 

We  selected  an  initial  region  for  the  discovery  of  differ¬ 
ential  methylation  patterns  that  associate  with  expression 
changes  based  on  two  criteria.  First,  average  CpG  density 
increases  roughly  2  kb  upstream  and  downstream  of  the 
TSS,  suggesting  that  sites  in  this  region  may  have  regula¬ 
tory  importance  (21).  Second,  increased  variability  of  dif¬ 
ferential  methylation  in  CpG  island  shores,  defined  as  the 
regions  up  to  2  kb  away  from  a  CpG  island,  has  been 
linked  to  differential  expression  (5).  To  search  for 
patterns  across  this  entire  area,  we  chose  a  conservative 
initial  region  of  10  kb  centered  on  the  TSS.  Using  a  cluster 
purity  threshold  of  0.85,  we  identified  27  clusters  that  were 
significantly  correlated  with  differential  expression,  con¬ 
taining  519  genes,  in  the  HMEC-HCC1954  data  set 
(Figure  2;  complete  clustering  results  are  in 
Supplementary  Data  1).  A  cartoon  depiction  of  each  of 
the  patterns  observed  is  shown  in  Figure  3.  Applying  our 
method  to  the  H1-IMR90  and  MCF7-T47D  Methyl- 
MAPS  data  sets  showed  that  our  method  could  still 
identify  clusters  corresponding  to  each  of  the  patterns  dis¬ 
covered  in  the  higher  quality  HMEC-HCC1954  data  set 
even  with  low  promoter  coverage  or  substantially  reduced 
sampling  (Supplementary  Data  2  and  3).  It  is  likely  that 
interpolation  and  Gaussian  smoothing  are  helpful  for 
analyzing  low  coverage  data.  Limiting  HMEC-HCC1954 
data  to  only  sites  probed  by  Methyl-MAPS  showed  the 
data  reduction  had  no  impact  on  the  ability  to  detect  each 
of  the  identified  patterns.  As  a  negative  control,  we 
randomly  scrambled  the  expression  values  for  all  genes 
in  each  data  set;  any  cluster  identified  as  significant  was 
a  false  positive.  For  1000  random  permutations,  our  tech¬ 
nique  identified  a  false-positive  cluster  in  1.7-2. 3%  of  the 
experiments  (Supplementary  Table  S2). 

Patterns  overlapping  the  TSS 

From  the  resulting  sets  of  significant  clusters,  we  sought  to 
characterize  the  common  features  of  the  methylation 
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Figure  2.  Example  of  clustering  methylation  signatures  from  HMEC- 
HCC1954  WGBS  data.  (A)  Complete  dendrogram  for  clustering  3566 
methylation  signatures.  Clusters  highlighted  in  orange,  magenta  and 
cyan  indicate  significant  clusters  with  purity  of  at  least  0.75,  0.85  and 
0.95,  respectively.  Subclusters  featured  in  (B,  C)  are  indicated  with 
arrows.  A  heat  map  of  expression  data  is  plotted  in  bars  alongside 
the  dendrogram.  F.C.  denotes  expression  fold  change;  green  indicates 
downregulation,  red  indicates  upregulation.  (B,  C)  Left  side  shows  the 
complete  cluster  with  boxes  to  indicate  expression.  On  the  right  side, 
the  top  panels  are  cartoons  depicting  the  relevant  pattern;  the  bottom 
panels  show  example  signatures  from  subclusters  of  a  significant 
HMEC-HCC1954  cluster.  Clustering  was  performed  on  10  kb  regions. 

(B)  Subcluster  showing  a  pattern  of  methylation  increase  across  the 
TSS,  similar  to  the  classic  methylation  of  a  CpG  island  across  the 
TSS.  Patterns  from  the  entire  cluster  are  in  Supplementary  Figure  S4. 

(C)  Subcluster  showing  a  pattern  defined  by  decrease  in  methylation  3' 
of  the  TSS,  similar  to  a  methylation  change  at  a  CpG  island  shore.  The 
entire  cluster  is  in  Supplementary  Figure  SI  1. 


A  Expression  Down 


Expression  Up 


TSSli  l- 

TSS2i 

LONGOi 

Expression  Up 


PROXli 

PROX2i 

PROX3i 

TSS 

Overlapping 

Patterns 


TSS 

Proximal 

Patterns 


Figure  3.  Summary  of  the  diverse  patterns  found  in  the  data  sets 
analyzed.  One  version  of  each  pattern  is  shown  to  the  left  with  its 
corresponding  inverted  pattern  to  the  right.  (A)  Patterns  that  overlap 
the  TSS  (TSS1,  TSS2  and  LONGO),  and  their  respective  inverted 
versions  (TSSli,  TSS2i,  and  LONGOi).  (B)  Patterns  proximal  to  the 
TSS  all  include  a  region  of  change  downstream  of  the  TSS  and  are 
separated  by  their  upstream  differences.  Shown  are  PROX1,  PROX2, 
PROX3  and  their  respective  inverted  versions:  PROXli,  PROX2i  and 
PROX3i. 


signatures  that  may  be  responsible  for  the  observed  rela¬ 
tionships  with  expression  change.  As  expected,  many 
clusters  contained  patterns  with  a  region  of  strong  hyper- 
or  hypomethylation  spanning  the  TSS  that  negatively 
correlated  with  expression  change  (Figure  2B).  Further  in¬ 
spection  of  HMEC-HCC1954  data  revealed  several  distinct 
patterns  overlapping  the  TSS.  After  rerunning  our  method 
using  methylation  signatures  based  on  a  30  kb  region 
centered  at  the  TSS,  three  distinct  patterns  emerged 
(Figure  3A):  a  hypermethylated  region  at  the  TSS  sur¬ 
rounded  by  long  hypomethylated  domains  (TSS1;  Figure 
4A),  a  hypermethylated  region  at  the  TSS  set  in  a  region 
with  invariant  methylation  levels  (TSS2;  Figure  2B)  and  a 
pattern  of  long  hypomethylated  domains,  but  with  no 
change  in  methylation  at  the  TSS  (LONGO;  Figure  4F). 
The  rarest  pattern,  LONGO,  was  also  observed  in  the  H9- 
IMR90  (Supplementary  Figure  S7)  and  H1-IMR90 
(Supplementary  Data  2)  comparisons.  Hypomethylated 
domains  associated  with  these  patterns  (TSS1,  LONGO) 
extend  up  to  several  Mb  in  both  directions  (Figure  4C). 
In  HMEC-HCC1954  data,  we  found  a  hypomethylated 
region  at  the  TSS  set  in  a  hypermethylated  region  (TSSli; 
Figure  4B).  While  not  all  patterns  were  observed  in  every 
comparison  we  analyzed,  the  overall  set  of  patterns  was 
common  across  all  data  sets.  For  instance,  MCF7-T47D 
and  H1-IMR90  data  sets  show  an  inverted  TSS2  pattern 
(TSS2i;  Supplementary  Data  2  and  3).  IMR90  fibroblasts 
exhibit  a  TSS1  pattern  relative  to  HI  stem  cells 
(Supplementary  Data  2),  although  the  hypomethylated 
domains  in  fibroblasts  are  much  smaller  (Figure  4E).  It 
has  been  suggested  that  the  observed  positive  correlation 
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Figure  4.  Analysis  of  patterns  overlapping  the  TSS.  (A,  B)  Top  panel  is  a  cartoon  depicting  the  relevant  pattern.  Bottom  panel  shows  an  example 
subcluster  from  an  HMEC-HCC1954  cluster.  Clustering  was  performed  on  30  kb  regions.  F.C.  denotes  expression  fold  change;  green  indicates 
downregulation,  red  indicates  upregulation.  (A)  Subcluster  characterized  by  hypermethylation  at  the  TSS  set  in  a  long  hypomethylated  domain 
(TSS1).  The  entire  cluster  is  in  Supplementary  Figure  S6.  (B)  Subcluster  characterized  by  hypomethylation  at  the  TSS  set  in  a  long  hypermethylated 
domain  (TSSli).  The  entire  cluster  is  in  Supplementary  Figure  S8.  (C)  Average  differential  methylation  and  CpG  density  for  genes  from  clusters 
identified  as  exhibiting  three  TSS  patterns  (TSS1,  TSSli  and  TSS2).  Example  signatures  for  each  of  the  three  patterns  are  shown  in  parts  (A),  (B)  and 
Figure  2B.  (D)  Alu  elements  and  RefSeq  genes  are  depleted  in  the  regions  around  genes  with  the  TSS1  pattern.  No  enrichment  or  depletion  of 
other  repeats  was  found  (Supplementary  Figure  S9).  (E)  The  TSS1  pattern  is  also  found  in  the  H1-IMR90  comparison  (Supplementary  Data  2). 
(F)  Example  subcluster  showing  the  LONGO  pattern  from  the  HMEC-HCC1954  comparison.  The  entire  cluster  is  in  Supplementary  Figure  S6. 


between  gene-body  methylation  and  expression  may  be  due 
to  similar  domains  (8).  Inspection  of  individual  genes  with 
the  LONGO  pattern  revealed  that  promoters  were 
hypomethylated  in  both  samples,  leading  us  to  speculate 
that  long  hypomethylated  domains  could  contribute  to 
gene  silencing,  possibly  through  the  recruitment  of 
factors  that  lead  to  the  formation  of  repressive  chromatin 
(6).  Clusters  representing  the  inverse  patterns — a 
hypomethylated  region  at  the  TSS  set  in  either  long 
hypermethylated  or  invariant  regions — were  also 
observed  (TSSli,  TSS2i;  Figure  4B). 

Patterns  proximal  to  the  TSS 

In  addition  to  patterns  with  differential  methylation  across 
the  TSS,  we  identified  multiple  clusters  in  all  data  sets 
characterized  by  a  change  in  methylation  downstream  of 
the  TSS  (Figure  2C,  Figure  5A  and  B).  Analysis  of  all  genes 
in  clusters  associated  with  this  pattern  indicated  that  it  pre¬ 
dominantly  occurs  within  3  kb  of  the  TSS  (Figure  5C).  This 
y  pattern  was  observed  in  several  distinct  clusters  due  to 
variations  in  other  parts  of  the  differential  methylation 
curves  (see  examples  in  Figure  5A  and  B).  The  TSS- 
proximal  patterns  found  by  our  discovery  method  are  in 
agreement  with  the  variation  in  methylation  found  at  CpG 
island  shores.  Interestingly  though,  we  find  no  significant 
association  between  the  proximal  methylation  patterns  and 
whether  promoters  are  classified  as  CpG-rich  or  -poor 
(Supplementary  Figure  S10).  This  may  suggest  that  these 
proximal  methylation  changes  are  not  confined  to  island 


shores,  or  it  may  be  due  to  the  fact  that  the  patterns  we 
discover  are  anchored  to  the  TSS. 

Although  some  clusters  with  3'  patterns  also  displayed 
methylation  change  upstream  of  the  TSS,  no  independent 
relationship  was  observed  between  a  5'  pattern  and  differ¬ 
ential  expression  (Figure  5A  and  B).  Differential  methyla¬ 
tion  downstream  of  the  TSS  was  consistently  linked  with 
expression  change,  regardless  of  upstream  hyper-  or 
hypomethylation.  To  further  probe  whether  distinct  cor¬ 
relative  patterns  occur  upstream  of  the  TSS,  we  ran  our 
method  using  signatures  defined  on  the  region  from  —5  kb 
to  the  TSS.  The  majority  of  identified  clusters  were 
characterized  by  changes  in  methylation  at  the  TSS. 
Genes  in  clusters  with  methylation  changes  5'  of  the  TSS 
often  had  correlative  3'  changes  as  well.  We  found  no 
clusters  in  any  of  the  data  sets  that  supported  a  convincing 
association  between  5'  changes  and  expression.  Examining 
5  kb  regions  shifted  downstream  of  the  TSS  discovered 
more  patterns  than  those  upstream  of  the  TSS,  consistent 
with  the  existence  of  the  3'  pattern  (Figure  5D)  and 
absence  of  a  5'  pattern.  From  the  cumulative  evidence, 
we  speculate  that  CpG  island  shore-like  regions 
upstream  of  the  TSS  may  not  be  independently  associated 
with  expression  changes.  Furthermore,  while  we  observed 
that  differential  methylation  patterns  across  the  TSS  were 
generally  associated  with  gene  silencing,  3'  methylation 
patterns  correlated  more  often  with  downregulation  than 
complete  silencing  (Supplementary  Figure  SI  2).  3' 

hypermethylation  events  have  previously  been  shown  to 
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Figure  5.  Patterns  containing  3'  methylation  changes  are  inversely  correlated  with  expression,  independent  of  changes  57  of  the  TSS.  (A,  B)  Top 
panel  is  a  cartoon  depicting  the  relevant  pattern.  Bottom  panel  shows  an  example  subcluster  from  the  default  HMEC-HCC1954  clustering  on  a  10  kb 
region.  Entire  clusters  for  both  parts  are  displayed  in  Supplementary  Figure  SI  1.  (A)  Subcluster  exhibiting  genes  with  a  decrease  in  methylation  on 
the  3'  side  of  the  TSS  and  an  increase  in  methylation  5'  of  the  TSS  and  with  a  significant  increase  in  expression  (PROX3i).  (B)  Subcluster  exhibiting 
genes  with  an  increase  in  methylation  on  both  3'  and  5;  sides  of  the  TSS  and  with  a  significant  decrease  in  expression  (PROX2).  (C)  Meta-gene 
analysis  of  genes  from  significant  clusters  whose  dominant  pattern  was  identified  as  including  a  3/-proximal  component  shows  that  the  3'  pattern  is 
typically  confined  to  within  3  kb  of  the  TSS.  From  top  to  bottom,  the  three  panels  depict  averages  for  HMEC-HCC1954,  MCF7-T47D  and  Hl- 
IMR90  data  sets.  (D)  The  number  of  new  genes  identified  increases  downstream  of  the  TSS,  but  not  upstream.  We  started  by  identifying  genes  in 
significant  clusters  for  a  5  kb  region  centered  at  the  TSS.  We  iteratively  moved  the  5  kb-wide  region  upstream  of  the  TSS  and  identified  all  genes  in 
significant  clusters  not  found  previously.  This  process  was  repeated  for  each  of  the  region  positions  used  by  the  gene  list  tool.  The  entire  process  was 
then  repeated  for  downstream  of  the  TSS. 


affect  expression  (22),  which  supports  the  idea  that  the  3' 
changes  we  observe  could  be  potentially  functional. 

Genomic  features  associated  with  particular  DNA 
methylation  patterns 

Gene  ontology  and  expression  signature  analysis  (23)  of 
the  genes  found  to  have  correlated  differential  methylation 
and  expression  showed  that  there  was  enrichment  for 
cancer-related  genes  in  HMEC-HCC1954  data,  while 
there  was  enrichment  for  genes  associated  with  differenti¬ 
ation  in  H1-IMR90  data.  Long  hypomethylated  and  par¬ 
tially  methylated  domains  in  cancer  cells  were  previously 
observed  to  have  distinct  sequence  contexts  (6,7).  We  find 
that  genes  with  the  TSS1  pattern  are  associated  with  a 
depletion  of  CpG  density,  Alu  elements  and  gene  density 
relative  to  all  gene  promoters  (Figure  4C  and  D).  These 
depletions  are  observed  in  regions  ranging  from  10  kb  up 
to  0.5  Mb  from  the  TSS.  In  summary,  genes  exhibiting  the 
TSS1  pattern  have  the  same  distinct  sequence  properties  as 
genes  found  in  long  hypomethylated  domains  (7). 

There  has  been  significant  interest  in  addressing  whether 
specific  sequences  direct  or  inhibit  methylation  at 


particular  promoters  or  CpG  islands  (5,24-26).  One  pos¬ 
sibility  is  that  different  sequences  may  direct  different 
methylation  patterns.  A  preliminary  search  for  motifs 
associated  with  each  discovered  pattern  did  not  find  a  sig¬ 
nificant  enrichment  for  any  novel  or  known  motifs 
associated  with  one  pattern  more  than  another.  In 
addition,  we  find  no  significant  association  between  each 
of  the  different  observed  methylation  patterns  and 
whether  promoters  are  classified  as  CpG-rich  or  CpG- 
poor  (Supplementary  Figure  S10). 

Quantifying  the  sensitivity  of  pattern  discovery 

Because  the  true  patterns  of  differential  methylation  that 
correlate  with  expression  change  are  unknown,  an  obvious 
question  is  whether  we  have  detected  all  correlative  methy¬ 
lation  patterns  in  the  data  set.  To  address  this  issue  and 
quantify  the  method’s  ability  to  recover  genes  from  known 
correlative  patterns,  we  created  a  model  to  simulate  dif¬ 
ferential  methylation  signatures  and  expression  values.  A 
complete  description  of  the  methods  for  simulated  data  is 
in  the  Supplemental  Methods. 
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Simulated  genes  with  a  predefined  correlation  between 
methylation  and  expression  were  added  into  two  kinds  of 
background  data  sets:  randomly  generated  simulated 
genes  with  no  correlation  with  expression  (Supplementary 
Figure  SI 3— SI 5)  and  real  data.  Using  wholly  simulated 
data  we  explored  the  types  of  patterns  our  method  could 
discover.  We  successfully  detected  a  variety  of  introduced 
patterns  including  peaks  of  differential  methylation  with 
fixed  widths  and  locations  relative  to  the  TSS 
(Supplementary  Figure  SI 6),  peaks  with  varying  location 
(Supplementary  Figure  SI 7)  and  peaks  with  different  vari¬ 
ances  in  height  (Supplementary  Figure  SI 8).  More 
complex  patterns  were  more  easily  detected,  presumably 
because  highly  constrained  curves  are  more  likely  to  have 
similar  Frechet  distances. 

We  also  explored  the  method’s  ability  to  discover  new 
patterns  in  a  background  of  real  methylation  data.  We 
introduced  three  simulated  patterns  (Supplementary 
Figure  SI 9)  into  HMEC-HCC1954  WGBS  data.  We 
varied  the  number  of  genes  added,  and  measured  both 
the  ability  to  discover  the  patterns  and  the  number  of 
genes  correctly  identified  (Supplementary  Figure  S20). 
We  found  that  the  patterns  were  easily  discoverable 
when  at  least  50-100  simulated  genes  were  introduced, 
depending  on  the  particular  simulated  pattern. 
Performance  was  impacted  when  one  of  the  simulated 
patterns  was  too  similar  to  the  existing  patterns 
(Supplementary  Figure  S21).  Furthermore,  we  found 
that  the  fraction  of  simulated  genes  required  to  reliably 
detect  a  pattern  decreased  as  the  number  of  genes  in  the 
data  set  increased  (Supplementary  Figure  S22).  This 
suggests  that  one  could  detect  rare  patterns  by  simply 
concatenating  multiple  data  sets. 

Enumerating  genes  with  correlative  methylation  signatures 

Generating  a  list  of  genes  for  which  expression  and  methy¬ 
lation  changes  are  potentially  linked  is  a  primary  interest 
of  any  genome- wide  methylation  profiling  experiment. 
The  method  described  above  is  tuned  to  discover 
patterns,  not  to  create  a  gene  list.  Individual  genes 
within  good  clusters  will  sometimes  be  false  positives, 
and  other  genes  will  be  excluded  because  their  clusters 
do  not  meet  the  high  purity  threshold.  To  produce  a 
better  gene  list,  we  executed  our  method  on  a  set  of 
overlapping  5  kb  regions  centered  at  a  fixed  set  of  loca¬ 
tions  around  the  TSS  (Figure  6A).  Comparison  of  gene 
lists  from  H1-IMR90  replicate  WGBS  data  shows  good 
concordance  between  replicates.  To  determine  the  extent 
to  which  genes  are  incorrectly  included  owing  to  the 
creation  of  errant  clusters,  we  randomly  scrambled  the 
expression  values  in  the  HMEC-HCC1954  dendrogram. 
For  1000  such  experiments  using  default  clustering  par¬ 
ameters,  only  seven  experiments  returned  any  false-posi¬ 
tive  genes:  six  reported  a  single  false  positive  and  a  seventh 
returned  two.  We  also  tested  the  ability  of  the  gene  list 
method  to  recover  introduced  simulated  genes  (see 
‘Quantifying  the  sensitivity  of  pattern  discovery’  section 
above).  For  several  simulated  patterns,  we  found  that  the 
resultant  gene  lists  included  nearly  all  simulated  genes 
when  50-100  were  present  (Supplementary  Figure  S20). 


Comparison  to  other  approaches 

We  compared  the  quality  of  the  gene  lists  produced  by  our 
approach  to  lists  constructed  by  two  commonly  used 
methods.  For  the  DMR  approach,  regions  of  differential 
methylation  are  defined  between  two  samples  using  a 
sliding  window.  DMRs  are  coupled  to  a  particular  gene 
using  a  cutoff  for  the  distance  between  the  DMR  and  the 
gene’s  TSS  (7,9).  For  the  promoter-based  approach,  a  fixed 
window  around  each  gene’s  TSS  defines  the  gene’s 
promoter.  If  methylation  changes  substantially  within 
this  window,  the  gene  is  labeled  as  differentially  methylated 
(10,27).  We  optimized  DMR-  and  promoter-based 
approaches  for  each  data  set  using  69  360  and  6174  param¬ 
eter  choices,  respectively  (Supplementary  Figure  S23  and 
Supplementary  Table  S3),  while  using  a  single  common  set 
of  parameters  for  our  approach  across  all  data  sets. 

Judging  the  quality  of  the  gene  lists  is  difficult  because 
there  is  no  experimental  gold  standard  data  set  for  which 
the  relationship  between  methylation  at  specific  CpG  sites 
and  expression  is  well  known  for  all  genes.  A  correlation 
coefficient  has  often  been  used  to  quantify  the  association 
between  methylation  and  expression.  When  applied  to  a  list 
of  genes  for  which  methylation  is  predicted  to  associate 
with  expression  change,  however,  the  correlation  coefficient 
only  judges  one  part  of  the  method’s  performance.  For 
example,  by  varying  its  parameters,  the  DMR  method 
can  produce  short  gene  lists  with  strong  correlations  or 
long  lists  with  weak  correlations.  To  compare  methods, 
we  directly  examine  the  trade-off  between  the  total 
number  of  differentially  expressed  genes  identified  as  poten¬ 
tially  correlated  versus  the  fraction  of  identified  genes  that 
are  actually  correlated  in  the  predicted  direction  (Figure  6). 
This  trade-off  is  somewhat  analogous  to  comparing  the  rate 
of  total  positives  to  the  rate  of  true  positives,  with  the  true¬ 
negative  and  false-negative  rates  being  unknown.  On  the 
basis  of  these  criteria,  our  approach  clearly  outperforms 
DMR-  and  promoter-based  methods.  For  example, 
consider  the  HMEC-HCC1954  data  (Figure  6B).  When 
the  DMR  method  is  examined  at  a  level  where  it  correctly 
associates  the  direction  of  expression  change  92%  of  the 
time,  it  returns  only  26  genes  (0.7%  of  all  differentially 
expressed  genes).  Our  approach  produces  a  list  of  461 
genes  (13%)  at  a  correct  association  rate  of  95%.  With 
less  restrictive  criteria,  the  DMR  method  gives  a  list  of 
717  genes  (20%)  at  a  correct  association  rate  of  71%. 
Our  technique  gives  a  list  of  roughly  the  same  size  (761 
genes)  at  a  correct  association  rate  of  93%. 

Table  1  presents  results  from  each  of  the  different  data 
sets  we  analyzed  and  includes  data  from  primary  cells  and 
cell  lines.  All  analyses  were  performed  using  the  default 
parameters.  These  results  imply  that  prior  approaches 
have  underestimated  the  strength  of  the  relationship 
between  differential  methylation  and  expression.  With  a 
better  model  of  the  underlying  patterns  of  methylation 
change,  it  is  clear  that  methylation  and  expression  data 
are  highly  associated. 

Gene  lists  for  low  coverage  data  sets 

We  next  sought  to  examine  how  our  approach  performed 
with  suboptimal  data.  Our  technique  returned  similar 
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Figure  6.  Gene  lists  generated  by  our  method  are  of  markedly  greater  length  and  quality  than  those  generated  by  alternative  methods.  (A)  Schematic 
of  our  process  for  generating  gene  lists.  (B-D)  Comparison  of  gene  lists  generated  using  our  approach  with  those  from  optimized  DMR  and 
promoter-based  methods  for  (B)  HMEC-HCC1954  WGBS,  (C)  IMR90-H1  WGBS  and  (D)  MCF7-T47D  Methyl-MAPS  data.  The  plot  shows  the 
trade-off  between  the  number  of  genes  identified  as  being  associated  with  differential  expression  based  on  their  methylation  (x-axis)  and  the  quality  of 
the  associations  (y-axis),  which  is  the  fraction  of  identified  genes  for  which  the  direction  of  expression  change  matches  the  expected  direction  based 
on  methylation.  Points  up  and  to  the  right  indicate  better  performance;  50%  quality  is  equivalent  to  random  guessing.  Only  optimal  parameter 
choices  with  an  inverse  correlation  between  methylation  and  expression  are  shown  for  DMR-  and  promoter-based  approaches  (see  Supplementary 
Figure  S23  for  further  information  about  this  optimization).  Promoter-based  approaches  were  optimized  across  both  all  CpG  sites  (All  Sites)  and  all 
significant  CpG  sites  (All  Sig.  Sites).  The  minimum  number  of  required  windows  m  was  varied  in  our  method  from  2  to  5  to  show  the  effects  of 
increased  stringency. 


Table  1.  Summary  of  the  numbers  of  genes  identified  by  the  gene  list  tool  for  each  comparison 


Sample  Comparison 

Differentially 

expressed 

genes 

CpGs  with  methylation 
difference  >0.3 

Genes  with  correct 
association 

Genes  with 

incorrect 

association 

Fraction  of  genes 
with  correct 
association 

HMEC  -  HCC1954 

3584 

6979  726 

1150 

118 

91% 

HI  -  IMR90 

2571 

7210721 

608 

68 

90% 

MCF7  -  T47D 

3740 

1  973  564 

664 

57 

92% 

ADS  -  ADS-Adipose 

2937 

332489 

138 

15 

90% 

ADS  -  ADS-iPSC 

3803 

5917387 

1124 

93 

92% 

HI  -  HI -Mesenchymal 

3714 

2170  686 

559 

40 

93% 

HI  -  HI -Neural  Progenitor 

2546 

829  297 

79 

3 

96% 

HI  -  H1-BMP4 

4103 

737  003 

65 

3 

96% 

HI  -  Hl-Mesendoderm 

2353 

765  051 

59 

5 

92% 

H9-IMR90 

5875 

7  669  782 

605 

58 

91% 

oocyte  -  ES  cell  (mouse) 

4727 

1204  883 

334 

25 

93% 

sperm  -  ES  cell  (mouse) 

4580 

4  364748 

1027 

104 

91% 
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results  for  the  low  coverage  IMR90-H1  comparison, 
which  has  no  minimum  coverage  cutoff,  while  DMR- 
and  promoter-based  methods  struggled  (Figure  6C).  We 
also  ran  experiments  on  the  HMEC-HCC1954  sample  pair 
to  test  the  method’s  performance  on  downsampled  data. 
First,  we  removed  raw  sequence  reads  at  random,  finding 
that  WGBS  data  obtained  with  an  average  coverage  as 
low  as  seven  resulted  in  little  loss  in  our  method’s  ability 
to  identify  genes  (Supplementary  Figure  S24A).  We  also 
removed  methylation  scores  at  random  from  the  mapped 
data,  and  found  that  94.5%  of  genes  were  still  detected 
when  50%  of  the  CpGs  were  removed  (Supplementary 
Figure  S24B).  Additionally,  our  method  outperforms  the 
optimized  DMR-  and  promoter-based  methods  for  the 
MCF7-T47D  sample  pair  collected  using  Methyl- 
MAPS,  which  contained  37.4%  of  the  CpG  sites  from 
the  10  kb  region  centered  at  the  TSS.  These  results 
suggest  that  our  method  is  robust  in  the  context  of 
missing  or  low  coverage  data. 


DISCUSSION 

A  summary  of  the  patterns  discovered  in  the  data  sets 
analyzed  is  presented  in  Figure  3.  One  primary  advance 
of  our  approach  is  in  its  use  of  spatial  information.  Several 
of  the  patterns  we  found  demonstrate  the  need  for  using 
spatial  information  about  specific  CpG  sites  when  trying 
to  connect  differential  methylation  to  expression  change. 
For  instance,  the  LONGO  pattern  is  positively  correlated 
with  expression,  while  the  3/-hypomethylation  pattern, 
commonly  seen  in  conjunction  with  a  S'-hypomethylation 
event  (PROX2i),  is  negatively  correlated.  Given  methyla¬ 
tion  marks  from  one  of  these  two  cases,  it  is  clear  that  the 
spatial  information  about  specific  CpG  locations  is  neces¬ 
sary  to  successfully  determine  the  direction  of  expression 
change.  This  may  help  explain  an  observation  made 
during  an  analysis  of  82  methylation  data  sets  from 
human  tissues  and  cell  lines  using  reduced  representation 
bisulfite  sequencing  (12).  The  authors  observed  that 
methylation  changes  in  CpG  island  sites  greater  than 
2  kb  downstream  were  sometimes  positively  correlated 
with  expression  and  sometimes  negatively  correlated. 
Based  on  our  findings,  one  explanation  for  their  observa¬ 
tion  is  that  they  are  observing  a  mixture  of  genes  with 
LONGO  and  3'  patterns. 

We  also  find  that  despite  variation  throughout  the 
vicinity  of  a  gene’s  promoter,  expression  change  is  often 
well  correlated  with  only  the  methylation  changes  in  a 
confined  area  relative  to  the  TSS.  This  point  is  well  sup¬ 
ported  by  the  many  examples  of  the  3/-proximal  pattern, 
which  is  seen  with  a  variety  of  methylation  activity 
upstream  including  hypermethylation,  hypomethylation 
or  little  activity  (Figure  3B).  DMR-based  methods  are 
generally  tuned  to  find  wider  regions  than  what  we 
observe  in  our  3/-proximal  clusters,  to  better  identify 
genes  with  classical  differential  methylation  at  a  CpG 
island  promoter  (e.g.  TSS2  and  TSS2i).  While  these 
methods  can  be  tuned  to  locate  narrower  DMRs,  this 
would  result  in  more  cases  where  contradictory  DMRs 
exist  near  the  TSS  (e.g.  one  hypermethylated  DMR  and 


one  hypomethylated  DMR,  such  as  PROX3  in  Figure  3B). 
In  these  cases,  there  is  no  obvious  way  to  decide  which 
DMR  may  be  influencing  transcription  without 
introducing  pre-learned  spatial  information.  As  another 
example,  the  DMR  approach  has  difficulty  discriminating 
between  the  LONGO  pattern  that  is  associated  with  a 
decrease  in  expression  and  the  PROXli  and  PROX2i 
patterns  that  are  associated  with  an  increase  in  expression. 
When  set  to  find  long  regions  downstream  of  the  TSS, 
DMRs  can  be  identified  with  positive  correlation 
between  expression  and  methylation.  When  set  to  find 
short  regions,  DMRs  can  be  identified  with  negative  cor¬ 
relations.  However,  no  single  set  of  DMR  parameters  can 
easily  simultaneously  capture  each  differential  methyla¬ 
tion  pattern  and  its  correlation  with  expression.  These 
examples  also  underscore  a  fundamental  limitation  of 
the  DMR  method:  it  cannot  be  used  to  discover  new 
patterns,  but  can  only  search  for  a  limited  set  of  relation¬ 
ships  that  are  already  known  to  exist. 

Spatial  information  has  proven  useful  to  the  analysis  of 
other  epigenetic  data  sets  as  well.  Recently,  a  model  con¬ 
sidering  the  spatial  locations  of  chromatin  marks  was  used 
to  train  a  support  vector  machine  to  predict  chromatin 
signals  at  transcription  factor  binding  sites  (28).  The 
authors  divided  each  fixed  region  into  50  bp  bins,  each 
of  which  was  used  as  a  feature  in  a  vector  for  classifica¬ 
tion.  The  success  of  this  approach  demonstrates  the  po¬ 
tential  of  spatial  models  to  better  capture  the  nature  of  the 
epigenetic  data  than  is  possible  with  a  simple  window- 
based  approach.  However,  for  analyzing  DNA  methyla¬ 
tion,  it  is  important  to  account  for  the  relationship 
between  such  bins  rather  than  treating  them  as  independ¬ 
ent  features.  For  instance,  in  Figure  2C,  the  topmost  and 
bottommost  example  curves  have  a  similar — but  slightly 
shifted — 3/-proximal  decrease  in  methylation  and  are  both 
upregulated.  If  we  used  predefined  nonoverlapping  bins 
here,  they  would  either  be  so  narrow  that  the  top  peak 
and  bottom  peak  lie  in  different  bins,  or  so  wide  that  the 
salient  feature  of  the  curve  is  diminished.  Because  our 
shape-similarity  approach  tolerates  some  movement  of 
peaks  in  the  x-direction,  features  such  as  these  are  natur¬ 
ally  associated  with  one  another. 

Finally,  our  findings  have  implications  for  how  DNA 
methylation  should  be  assayed  to  preserve  all  potentially 
useful  information.  Using  downsampled  WGBS  data  and 
Methyl-MAPS  data  we  demonstrated  that  we  can  discover 
the  full  set  of  patterns  from  Figure  3  without  needing 
values  at  all  CpG  sites,  as  long  as  they  are  removed  in  a 
relatively  unbiased  manner  (Supplementary  Figure  S24, 
Supplementary  Data  3).  However,  information  from 
some  sites  may  be  more  informative  than  others.  Many 
experimental  methods  attempt  to  assess  a  gene’s  methyla¬ 
tion  state  by  restricting  analysis  to  only  CpG-rich  regions, 
or  to  only  a  handful  of  CpG  sites  in  and  around  the 
promoter.  It  is  unclear  whether  such  techniques  measure 
methylation  at  the  correct  sites  to  discriminate  the  full 
spectrum  of  methylation  patterns  that  correlate  with  tran¬ 
scriptional  changes.  As  additional  single-base  resolution 
genome-wide  DNA  methylation  data  sets  become  avail¬ 
able,  we  can  explore  which  subsets  of  individual  CpGs 
most  inform  promoter  methylation  patterns  and  thus 
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need  to  be  experimentally  measured  to  accurately  assess 
changes  in  a  gene’s  methylation  state. 


CONCLUSIONS 

Our  findings  suggest  that  characterizing  gene  promoters 
simply  as  ‘methylated’  or  ‘unmethylated’  is  insufficient.  By 
considering  the  entire  set  of  methylation  changes  near  the 
promoter,  we  found  and  described  a  variety  of  methyla¬ 
tion  patterns  that  correlate  with  expression  change.  The 
power  of  our  method  is  its  ability  to  discover  and  separate 
distinct  patterns  without  any  prior  knowledge  about 
existing  relationships,  which  cannot  be  accomplished 
with  contemporary  approaches.  This  allows  us  to  use  the 
full  potential  of  unbiased  genome-wide  profiling  of  DNA 
methylation  to  reveal  previously  unknown  information 
about  methylation’ s  functional  role.  Although  we 
applied  our  method  on  regions  of  various  widths  around 
the  TSS,  all  correlative  patterns  except  those  associated 
with  the  long  hypomethylated  domains  were  found 
within  5  kb  of  the  TSS  (Figure  5D).  Interestingly,  all 
patterns  found  in  the  cancer  cell  data  sets  were  also 
found  in  the  ES  and  iPSC  cell  data  sets.  While  5' 
upstream  patterns  are  observed,  these  appear  to  occur 
due  to  correlations  with  the  3'  downstream  patterns.  By 
appropriately  capturing  the  diverse  set  of  methylation 
patterns  that  exist,  we  observe  a  high  level  of  association 
between  changes  in  a  gene’s  methylation  state  and  changes 
in  its  expression. 

A  strength  of  the  technique  described  here  is  its  poten¬ 
tial  for  expansion  to  examine  more  general  epigenetic 
modifications.  We  have  confined  our  analysis  to  data 
from  genome-wide  single-base  resolution  methods  such 
as  WGBS  or  Methyl-MAPS.  However,  similar  approaches 
could  be  used  to  analyze  the  relationships  between  ex¬ 
pression  and  other  epigenetic  patterns,  such  as 
5-hydroxymethylcytosine,  non-CpG  methylation,  and 
histone  modifications.  All  expression  data  used  in  this 
study  came  from  single-end  short  read  RNA-Seq  experi¬ 
ments.  If  long-read  paired-end  RNA-Seq  data  were 
available,  it  could  easily  be  used  with  our  method 
to  understand  alternate-promoter  and  isoform-specific 
methylation  patterns.  Considering  the  explosion  of  experi¬ 
ments  aiming  to  profile  epigenetic  landscapes,  this  method 
represents  a  valuable  tool  for  exploring  the  relationships 
between  changes  in  epigenetic  patterns  and  transcription 
both  in  normal  cellular  function  and  in  human  disease. 


SUPPLEMENTARY  DATA 

Supplementary  Data  are  available  at  NAR  Online: 
Supplementary  Tables  1-4,  Supplementary  Figures  1-24, 
Supplementary  Methods,  Supplementary  Data  1-3  and 
Supplementary  References  [29-32]. 

ACKNOWLEDGEMENTS 

The  authors  thank  Yu-Tseh  Chi  and  Tao  Ju  for  helpful 
discussions  about  curve  comparison  algorithms.  The 


authors  also  thank  Christopher  Schlosberg  for  helpful  dis¬ 
cussions  and  reading  of  this  manuscript. 

FUNDING 

National  Institutes  of  Health  [NIH  R00  CA127360,  NIH 
R21  LM011199  to  J.R.E.];  the  Department  of  Defense 
to  J.R.E.];  a  Siteman  Cancer  Center  Breast 
Cancer  Program  career  development  award  [to 
J.R.E.].  Funding  for  open  access  charge:  The  Internal 
Fund  from  the  Department  of  Medicine, 
Washington  University  in  St.  Louis. 

Conflict  of  interest  statement.  None  declared. 


REFERENCES 

1.  Laurent, L.,  Wong,E.,  Li,G.,  Huynh, T.,  Tsirigos,A.,  Ong,C.T., 
Low,H.M.,  Kin  Sung,K.W.,  Rigoutsos,I.,  Loring,J.  et  al.  (2010) 
Dynamic  changes  in  the  human  methylome  during  differentiation. 
Genome  Res.,  20,  320-331. 

2.  Ehrlich, M.  (2002)  DNA  methylation  in  cancer:  too  much,  but 
also  too  little.  Oncogene ,  21,  5400-5413. 

3.  Rakyan,V.K.,  Down,T.A.,  Balding, D.J.  and  Beck,S.  (2011) 
Epigenome-wide  association  studies  for  common  human  diseases. 
Nat.  Rev.  Genet.,  12,  529-541. 

4.  Saxonov,S.,  Berg,P.  and  Brutlag,D.L.  (2006)  A  genome-wide 
analysis  of  CpG  dinucleotides  in  the  human  genome  distinguishes 
two  distinct  classes  of  promoters.  Proc.  Natl  Acad.  Sci.  USA, 

103,  1412-1417. 

5.  Doi,A.,  Park, I. H.,  Wen,B.,  Murakami, P.,  Aryee,M.J.,  Irizarry, R., 
Herb,B.,  Ladd-Acosta,C.,  Rho,J.,  Loewer,S.  et  al.  (2009) 
Differential  methylation  of  tissue-  and  cancer- specific  CpG  island 
shores  distinguishes  human  induced  pluripotent  stem  cells, 
embryonic  stem  cells  and  fibroblasts.  Nat.  Genet.,  41,  1350-1353. 

6.  Hon,G.C.,  Hawkins, R.D.,  Caballero, O.L.,  Lo,C.,  Lister, R., 
Pelizzola,M.,  Valsesia,A.,  Ye,Z.,  Kuan,S.,  Edsall,L.E.  et  al.  (2012) 
Global  DNA  hypomethylation  coupled  to  repressive  chromatin 
domain  formation  and  gene  silencing  in  breast  cancer.  Genome 
Res.,  22,  246-258. 

7.  Hansen, K.D.,  Timp,W.,  Bravo, H.C.,  Sabunciyan,S.,  Langmead,B., 
McDonald, O.G.,  Wen,B.,  Wu,H.,  Liu,Y.,  Diep,D.  et  al.  (2011) 
Increased  methylation  variation  in  epigenetic  domains  across 
cancer  types.  Nat.  Genet.,  43,  768-775. 

8.  Berman, B.P.,  Weisenberger,D.J.,  Aman,J.F.,  Hinoue,T., 
Ramjan,Z.,  Liu,Y.,  Noushmehr,H.,  Lange, C.P.,  van  Dijk,C.M., 
Tollenaar,R.A.  et  al.  (2012)  Regions  of  focal  DNA 
hypermethylation  and  long-range  hypomethylation  in  colorectal 
cancer  coincide  with  nuclear  lamina-associated  domains.  Nat. 
Genet.,  44,  40-46. 

9.  Lister, R.,  Pelizzola,M.,  Dowen,R.H.,  Hawkins, R.D.,  Hon,G., 
Tonti-Filippini,J.,  Nery,J.R.,  Lee,L.,  Ye,Z.,  Ngo,Q.M.  et  al. 

(2009)  Human  DNA  methylomes  at  base  resolution  show 
widespread  epigenomic  differences.  Nature,  462,  315-322. 

10.  Edwards, J.R.,  O’Donnell, A. H.,  Rollins, R.A.,  Peckham,H.E., 
Lee,C.,  Milekic,M.H.,  Chanrion,B„  Fu,Y.,  Su,T.,  Hibshoosh,H. 
et  al.  (2010)  Chromatin  and  sequence  features  that  define  the  fine 
and  gross  structure  of  genomic  methylation  patterns.  Genome 
Res.,  20,  972-980. 

11.  Bock,C.  (2012)  Analysing  and  interpreting  DNA  methylation 
data.  Nat.  Rev.  Genet.,  13,  705-719. 

12.  Varley,K.E.,  Gertz,J.,  Bowling, K.M.,  Parker, S.L.,  Reddy, T.E., 
Pauli-Behn,F.,  Cross, M.K.,  Williams, B. A., 
Stamatoyannopoulos,J.A.,  Crawford, G.E.  et  al.  (2013)  Dynamic 
DNA  methylation  across  diverse  human  cell  lines  and  tissues. 
Genome  Res.,  23,  555-567. 

13.  Eckhardt,F.,  Lewin,J.,  Cortese,R.,  Rakyan,V.K.,  Attwood,J., 
Burger, M.,  Burton, J.,  Cox,T.V.,  Davies, R.,  Down, T. A.  et  al. 
(2006)  DNA  methylation  profiling  of  human  chromosomes  6,  20 
and  22.  Nat.  Genet.,  38,  1378-1385. 


Downloaded  from  http://nar.oxfordjournals.org/  at  Washington  University,  Law  School  Library  on  October  3,  2014 


Nucleic  Acids  Research,  2013,  Vol.  41,  No.  14  6827 


14.  Hansen, K.D.,  Langmead,B.  and  Irizarry, R.A.  (2012)  BSmooth: 
from  whole  genome  bisulfite  sequencing  reads  to  differentially 
methylated  regions.  Genome  Biol.,  13,  R83. 

15.  Alt, H.  and  Godau,M.  (1995)  Computing  the  Frechet  distance 
between  two  polygonal  curves.  Int.  J.  Comput.  Geom.  Appl.,  5, 
75-91. 

16.  Eiter,T.  and  Mannila,H.  (1994)  Computing  discrete  Frechet 
distance.  Tech.  Report  CS-TR-2008-0010.  University  of  Texas  at 
San  Antonio,  San  Antonio,  TX. 

17.  Aronov,B.,  Har-Peled,S.,  Knauer,C.,  Wang,Y.  and  Wenk,C. 

(2006)  In:  Proceedings  of  the  14th  conference  on  Annual  European 
Symposium  -  Volume  14.  Springer- Verlag,  Zurich,  Switzerland, 
pp.  52-63. 

18.  Rada-Iglesias,A.,  Bajpai,R.,  Swigut,T.,  Brugmann,S.A., 

Flynn, R.A.  and  Wysocka,J.  (2011)  A  unique  chromatin  signature 
uncovers  early  developmental  enhancers  in  humans.  Nature ,  470, 
279-283. 

19.  Lister, R.,  Pelizzola,M.,  Kida,Y.S.,  Hawkins, R.D.,  Nery,J.R., 
Hon,G.,  Antosiewicz-Bourget,J.,  0’Malley,R.,  Castanon,R., 
Klugman,S.  et  al.  (2011)  Hotspots  of  aberrant  epigenomic 
reprogramming  in  human  induced  pluripotent  stem  cells.  Nature , 
471,  68-73. 

20.  Kobayashi,H.,  Sakurai,T.,  Imai,M.,  Takahashi,N.,  Fukuda,A., 
Yayoi,0.,  Sato,S.,  Nakabayashi,K.,  Hata,K.,  Sotomaru,Y.  et  al. 
(2012)  Contribution  of  intragenic  DNA  methylation  in  mouse 
gametic  DNA  methylomes  to  establish  oocyte-specific  heritable 
marks.  PLoS  Genet.,  8,  el002440. 

21.  Majewski,J.  and  Ott,J.  (2002)  Distribution  and  characterization  of 
regulatory  elements  in  the  human  genome.  Genome  Res.,  12, 
1827-1836. 

22.  Appanah,R.,  Dickerson,D.R.,  Goyal,P.,  Groudine,M.  and 
Lorincz,M.C.  (2007)  An  unmethylated  3’  promoter-proximal 
region  is  required  for  efficient  transcription  initiation.  PLoS 
Genet.,  3,  e27. 

23.  Culhane,A.C.,  Schroder, M.S.,  Sultana, R.,  Picard, S.C., 
Martinelli,E.N.,  Kelly, C.,  Haibe-Kains,B.,  Kapushesky,M., 


St  Pierre, A. A.,  Flahive,W.  et  al.  (2012)  GeneSigDB:  a  manually 
curated  database  and  resource  for  analysis  of  gene  expression 
signatures.  Nucleic  Acids  Res.,  40,  D1060-D1066. 

24.  Das,R.,  Dimitrova, N.,  Xuan,Z.,  Rollins, R.A.,  Haghighi,F., 
Edwards,J.R.,  Ju,J.,  Bestor,T.H.  and  Zhang, M.Q.  (2006) 
Computational  prediction  of  methylation  status  in 
human  genomic  sequences.  Proc.  Natl  Acad.  Sci.  USA ,  103, 
10713-10716. 

25.  Bock,C.,  Paulsen, M.,  Tierling,S.,  Mikeska,T.,  Lengauer,T.  and 
Walter, J.  (2006)  CpG  island  methylation  in  human  lymphocytes  is 
highly  correlated  with  DNA  sequence,  repeats,  and  predicted 
DNA  structure.  PLoS  Genet.,  2,  e26. 

26.  Feltus,F.A.,  Lee,E.K.,  Costello, J.F.,  Plass,C.  and  Vertino,P.M. 
(2003)  Predicting  aberrant  CpG  island  methylation.  Proc.  Natl 
Acad.  Sci.  USA,  100,  12253-12258. 

27.  Meissner, A.,  Mikkelsen,T.S.,  Gu,H.,  Wernig,M.,  Hanna, J., 
Sivachenko,A.,  Zhang, X.,  Bernstein, B.E.,  Nusbaum,C.,  Jaffe,D.B. 
et  al.  (2008)  Genome-scale  DNA  methylation  maps  of  pluripotent 
and  differentiated  cells.  Nature,  454,  766-770. 

28.  Arvey,A.,  Agius,P.,  Noble, W.S.  and  Leslie, C.  (2012)  Sequence 
and  chromatin  determinants  of  cell-type-specific  transcription 
factor  binding.  Genome  Res.,  22,  1723-1734. 

29.  Rollins, R.A.,  Haghighi,F.,  Edwards, J.R.,  Das,R.,  Zhang, M.Q., 
Ju,J.  and  Bestor,T.H.  (2006)  Large-scale  structure  of  genomic 
methylation  patterns.  Genome  Res.,  16,  157-163. 

30.  Trapnell,C.,  Pachter,L.  and  Salzberg,S.L.  (2009)  TopHat: 
discovering  splice  junctions  with  RNA-Seq.  Bioinformatics,  25, 
1105-1111. 

31.  Robinson, M.D.,  McCarthy, D.J.  and  Smyth, G.K.  (2010)  edgeR:  a 
Bioconductor  package  for  differential  expression  analysis  of 
digital  gene  expression  data.  Bioinformatics,  26,  139-140. 

32.  Robinson, M.D.  and  Oshlack,A.  (2010)  A  scaling  normalization 
method  for  differential  expression  analysis  of  RNA-seq  data. 
Genome  Biol.,  11,  R25. 


Downloaded  from  http://nar.oxfordjournals.org/  at  Washington  University,  Law  School  Library  on  October  3,  2014 


